Practical Implementations of Principal Component Analysis

Nick Belgau, Oscar Hernandez Mata

2024-08-01

Introduction

  • PCA is a dimensionality reduction technique that maintains most information in the data in a new transformed feature space.

  • If variables initially show high correlation, that indicates redundant information that PCA can filter out.

  • A technique for extracting insights from complex datasets.

  • Linear combinations of the original variables, but in a new, reduced coordinate system with axes aligned in the directions of maximum variability.

(Rahayu et al. 2017), (Joshi and Patil 2020)

Advantages

  • Eliminates multicollinearity.

  • Simpler models - improves generalization, reduces overfitting, improves prediction efficiency.

  • Filters noise and irrelevant variations.

Limitations

  • Assumes linearity - if nonlinear relationships exist, less effective.

  • Sensitive to outliers - impacts contribution from each feature.

(Altman and Krzywinski 2018), (Bharadiya 2023), (Joshi and Patil 2020), (Ali, Wassif, and Bayomi 2024)

  • Explored a few different applications:
  1. Dimensionality reduction of a tabular dataset - demographic census data

  2. Image classification - SVM and CNN modeling

  • Although PCA can be used for many real world applications like signal processing and image compression, sometimes there is a better tool for the job.

  • This presentation will highlight when it may or may not be an appropriate application for PCA.

The Math behind PCA

SVD is faster and more accurate than eigen-decomposition

  • To reduce the dataset, some linear algebra is required

  • Although PCA is traditionally taught through eigen-decomposition of the covariance matrix, Singular Value Decomposition (SVD) is always used in practice because it elminates the calculate and handle the covariance matrix.

  • Numerical stability.

  • Efficient with large datasets.

sklearn.decomposition.PCA

stats::prcomp()

(Johnson and Wichern 2023)

Decomposition
- Ensure features are continuous and standardized (\(\mu\) = 0, \(\sigma\) = 1), then decompose the \(X\) data: \[ X = U \Sigma V^T \]

  • Each column of \(𝑉\) represents a principal component (PC) which are orthogonal to each other and on the new axes.

Calculate explained variance
- The diagonal singular value matrix \(\Sigma\) corresponds to the strength of each PC:
\[ \text{variance_explained} = \frac{\sigma_i^2}{\sum \sigma_i^2} \]

Dimensionality reduction:
- Select PCs based on cumulative explained variance target (95%) and truncate \(V\).
- Transform \(X\) into the new feature space, effectively reducing the dimensions: \[ X_{\text{transformed}} = X V^{T}_{\text{selected}} \]

The effectiveness of PCA relies on satisfying these points:

  1. Linearity
  • PCA assumes that resulting principal components are linear combinations of the original variables. Nonlinear relationships may lead to low covariance values which can lead to an undervalued representation of their signficance.
  1. Continuous data
  • Standardizing the data requires continuous features.

  • Handle categorical variables separately.

  1. Data standardization
  • Scaling standardizes the variance of each variable to ensure equal contributions.

  • Mean-centering has a similar impact: ensuring PCs capture the direction of maximum variance.

  • Outliers can also distort the PCs, so they should be identified and handled appropriately.

Application 1 - Demographic Data

  • This application was inspired by a paper published by UWF (Amin, Yacko, and Guttmann) on Alzheimer’s disease mortality.

  • US census data: contains demographic, health, and environmental metrics.

Column
Obesity Age Adj
Smoking Rate
Diabetes
Heart Disease
Cancer
Food Index
Poverty Percent
Physical Inactivity
Mercury TPY
Lead TPY
Atrazine High KG

(Tejada-Vera 2013) (Amin, Yacko, and Guttmann 2018)

Continuous Variables

  • Data types are numeric with high cardinality.
  • Scaling and mean-centering will be needed.
skim(data)
Data summary
Name data
Number of rows 1143
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
obesity_age_adj 0 1 31.65 3.76 19.00 29.16 31.26 33.95 46.92 ▁▆▇▂▁
Smoking_Rate 0 1 25.06 3.56 10.72 22.94 25.62 27.68 32.86 ▁▁▃▇▂
Diabetes 0 1 10.64 1.62 6.48 9.32 10.56 11.69 17.92 ▂▇▆▁▁
Heart_Disease 0 1 126.68 38.46 41.20 99.90 120.10 146.85 279.20 ▂▇▃▁▁
Cancer 0 1 187.78 26.55 75.33 170.25 188.26 204.40 370.64 ▁▇▆▁▁
Mercury_TPY 0 1 0.02 0.07 0.00 0.00 0.00 0.00 0.94 ▇▁▁▁▁
Lead_TPY 0 1 0.14 0.30 0.00 0.01 0.04 0.14 2.80 ▇▁▁▁▁
Food_index 0 1 6.60 1.30 0.00 5.90 6.70 7.40 10.00 ▁▁▃▇▁
Poverty_Percent 0 1 19.21 6.67 0.00 14.90 18.80 23.15 47.70 ▁▇▇▁▁
Atrazine_High_KG 0 1 4531.16 24239.04 0.00 94.50 632.80 3480.10 768660.60 ▇▁▁▁▁
SUNLIGHT 0 1 17689.04 1037.82 15389.96 16897.70 17723.25 18285.74 21671.87 ▃▇▆▂▁

Linarity Analysis

  • Harvey-Collier Test: statistical test for pairwise linearity testing.

  • Assymetry is due to X~Y and Y~X for recursive residuals.

(Maureen, Oyinebifun, and Christopher 2022), (Harvey and Collier 1977)

Visually checking scatter plots is not a realistic method for inspecting linearity in real-world applications. Correlation plots are insufficient.

  • No transformations and acknowledge some information loss.

Outliers Analysis

  • Outliers can inflate eigenvalues and PCs.

  • PCA does not require normality, but a roughly normal distribution minimizes impact from outliers.

  • Distribution (Before):
                   Kurtosis   Skewness
Atrazine_High_KG 866.377802 27.6792372
Mercury_TPY       72.724213  7.4402444
Lead_TPY          34.286296  5.0314443
Cancer             5.489565  0.3562768
Food_index         4.878495 -0.8643685
Heart_Disease      3.867908  0.8405355
Poverty_Percent    3.795263  0.4674652
obesity_age_adj    3.680773  0.2879323
Smoking_Rate       3.531763 -0.7543274
Diabetes           3.344971  0.5427094
SUNLIGHT           2.974585  0.3380277
  • Box Cox transformation (After):

(“Compression of Spectral Data Using Box-Cox Transformation” 2014)

Multicollinearity Analysis

Correlation

  • While not a complete diagnosis, identifying highly correlated variables can gauge the effectiveness of PCA.

Variation Inflation Factor (VIF)

  • Consider a scenario where y=‘Heart_Disease’.

  • Multicollinearity between ‘obesity_age_adj’ and ‘Diabetes’ may be eliminated.

                      VIF
obesity_age_adj  4.321218
Smoking_Rate     1.812287
Diabetes         4.009033
Cancer           1.640111
Mercury_TPY      2.098779
Lead_TPY         2.143279
Food_index       2.159996
Poverty_Percent  2.228750
Atrazine_High_KG 1.085929
SUNLIGHT         1.227205

(Altman and Krzywinski 2018)

Principal Component Analysis

  • Mean-centering and scaling was executed at time of PCA execution.
pca_result <- prcomp(data_transform, center=TRUE, scale.=TRUE)
  • What’s more important - info retention or dimensionality reduction?
  • Common practice is to aim for 70-95%, but depends on the application.
  • The first few components should capture the majority of variance:
       Std Dev Eigenvalues Proportion Cumulative Variance
PC1  1.9273424   3.7146486    0.33770             0.33770
PC2  1.3267963   1.7603883    0.16004             0.49773
PC3  1.2316793   1.5170338    0.13791             0.63564
PC4  0.9782909   0.9570531    0.08700             0.72265
PC5  0.9144745   0.8362637    0.07602             0.79867
PC6  0.7843754   0.6152447    0.05593             0.85460
PC7  0.7010809   0.4915145    0.04468             0.89929
PC8  0.6479885   0.4198892    0.03817             0.93746
PC9  0.5455382   0.2976120    0.02706             0.96451
PC10 0.5085014   0.2585737    0.02351             0.98802
PC11 0.3630131   0.1317785    0.01198             1.00000

Scree Plot

  • Elbow at PC4, indicating a point of diminishing returns.
  • The first four components might be a sufficient summary of the data.
plot(pca_result, type = "l", col = "#215B9D", lwd = 2)

Eigenvectors (loadings)

  • PC1: Health and lifestyle factors; increases in obesity, smoking, diabetes, heart disease, and cancer are correlated with poorer health.

  • PC2: Environmental exposure; higher Mercury and Lead decreasing the component score, indicating negative impacts.

  • PC3: Food and poverty; decreased access to quality food correlates with increased poverty.

  • PC4: Chemicals and cancer; potential link from agricultural Atrazine.

                         PC1         PC2         PC3          PC4
obesity_age_adj   0.45539911 -0.08923142  0.04616352 -0.007211041
Smoking_Rate      0.37885662  0.04248172 -0.28743323  0.128820773
Diabetes          0.43751888 -0.11214803  0.07923426  0.080961617
Heart_Disease     0.23446146  0.05725530 -0.35130391 -0.119261323
Cancer            0.35550386 -0.02396108 -0.27508389  0.182844585
Mercury_TPY      -0.07090470 -0.66554141 -0.12306134  0.194309805
Lead_TPY         -0.06505621 -0.68349434 -0.02491826  0.092634475
Food_index       -0.32036082  0.06299314 -0.49456146 -0.040732247
Poverty_Percent   0.37456589  0.02855038  0.35864042 -0.028913891
Atrazine_High_KG  0.09967273 -0.22949368 -0.02370473 -0.936369219
SUNLIGHT         -0.11906414 -0.07901314  0.56599160  0.059355967

Biplot

  • A biplot visualizes the first 2 PCs, showing the data projection and how each variable contributes to the PC (arrow magnitude and direction).

  • Variables that are orientated in the same direction or in the complete opposite direction are correlated - a redundancy in information.

Biplot (cont.)

  • Food_index is negatively correlated to obesity_age_adj and Diabetes which contributes redundant information to the dataset.
  • Positively correlated groups due to the small angle between the vectors: Mercury and Lead; and all the health metrics.

Application 2: Image Classification

  • PCA is frequently touted as an effective technique to improve image classification tasks through dimensionality reduction.

  • In this application, the effectiveness of PCA was evaluated for a Support Vector Machine (SVM) algorithm and compared to the results from a modern Convolutional Neural Network (CNN).

  • CIFAR-10 dataset contains 10,000 images of 10 labeled classes.

(Li et al. 2012)

Principal Component Analysis (PCA)

  • Normalized to have pixel values between 0 and 1.

  • Training, validation, and test sets were created.

  • After flattening, PCA was applied to retain 95% of the explained variance.

Modeling: Support Vector Machine (SVM)

  • Traditionally, SVM used for image classification (rbf, nonlinear kernel).

  • Compared SVM to SVM with PCA-reduced data.

  • PCA model achieved similar accuracy but 10x faster prediction time.

          Model  Accuracy  Prediction Time (s), n=100
0           SVM     0.536                       11.96
1  SVM with PCA     0.533                        1.19
  • This demonstrates the effectiveness of dimensionality reduction in speeding up predictions without compromising accuracy.

Modeling: Convolutional Neural Network

  • Recognize situations where PCA may not be the optimal choice.

  • CNNs have become the gold-standard for for image classification tasks.

  • PCA is typically not used before a CNN because it destroys the spatial complexity by flattening the data.

(Goel, Goel, and Kumar 2023)

  • 9-layer CNN, Conv2D layers, ‘softmax’ for multi-class probs.

  • Model architecture addresses overfitting by using Dropout and MaxPooling2D.

model_cnn = Sequential([
    Input(shape=(32, 32, 3)),
    Conv2D(32, 3, padding='valid', activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Conv2D(64, 3, activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Conv2D(128, 3, activation='relu'),
    Flatten(),
    Dense(64, activation='relu'),
    Dropout(0.50),
    Dense(10, activation='softmax'),
])

Model Evaluation

  • Evaluations on test data.

  • Prediction time was 5x faster than the SVM with PCA model and with a model size of 2.6 MB and accuracy over over 70%.


1/4 [======>.......................] - ETA: 0s
4/4 [==============================] - 0s 6ms/step

1/4 [======>.......................] - ETA: 0s
4/4 [==============================] - 0s 9ms/step
          Model  Accuracy  Prediction Time (s), n=100  Model Size (MB)
0           SVM     0.536                       11.96            894.7
1  SVM with PCA     0.533                        1.19             65.2
2           CNN     0.705                        0.23              2.6
3     CNN Tuned     0.736                        0.19             10.1
  • These qualities make CNNs a great choice for real-time image classification, deployed directly on IoT devices without prior dimensionality reduction.

Conclusion

  • PCA simplifies data through dimensionality reductions by transforming original variables into uncorrelated principal components.

  • Benefits: captures patterns, addresses multicollinearity and overfitting, and enhances computational efficiency and model performance.

  • Challenges: Sensitive to outliers and may lose important information in nonlinear relationships.

  • Application #1 - Reduced the dimensions of tabular data and how it grouped features from demographic US census data into buckets like health and environmental factors.

  • Application #2 - Image compression and identified when PCA may not be the best technique.

  • PCA is really helpful tool to have in your toolbox because it can be used in a wide range of ML applications and throughout the EDA process.

Ali, Ibrahim, Khaled Wassif, and Hanaa Bayomi. 2024. “Dimensionality Reduction for Images of IoT Using Machine Learning.” Scientific Reports 14: 7205. https://doi.org/10.1038/s41598-024-57385-4.
Altman, Naomi, and Martin Krzywinski. 2018. “The Curse(s) of Dimensionality.” Nature Methods 15 (6): 397–400. https://doi.org/10.1038/s41592-018-0019-x.
Amin, R. W., E. M. Yacko, and R. P. Guttmann. 2018. “Geographic Clusters of Alzheimer’s Disease Mortality Rates in the USA: 2008-2012.” Journal of Prevention of Alzheimer’s Disease (JPAD) 3.
Bharadiya, Jasmin Praful. 2023. “A Tutorial on Principal Component Analysis for Dimensionality Reduction in Machine Learning.” International Journal of Innovative Research in Science Engineering and Technology 8 (5): 2028–32. https://doi.org/10.5281/zenodo.8002436.
“Compression of Spectral Data Using Box-Cox Transformation.” 2014. Color Research & Application 39 (2). https://doi.org/10.1002/col.21771.
Goel, Akash, Amit Kumar Goel, and Adesh Kumar. 2023. “The Role of Artificial Neural Network and Machine Learning in Utilizing Spatial Information.” Spatial Information Research 31: 275–85. https://doi.org/10.1007/s41324-022-00494-x.
Harvey, A., and P. Collier. 1977. “Testing for Functional Misspecification in Regression Analysis.” Journal of Econometrics 6: 103–19.
Johnson, Richard, and Dean Wichern. 2023. Applied Multivariate Statistical Analysis. Pearson.
Joshi, Ketaki, and Bhushan Patil. 2020. “Prediction of Surface Roughness by Machine Vision Using Principal Components Based Regression Analysis.” Procedia Computer Science 167: 382–91. https://doi.org/10.1016/j.procs.2020.03.242.
Li, Jun, Saurabh Prasad, James E Fowler, and Lori M Bruce. 2012. “PCA-Based Feature Reduction for Hyperspectral Remote Sensing Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 50 (1): 370–83.
Maureen, Nwakuya Tobechukwu, Biu Emmanuel Oyinebifun, and Ekwe Christopher. 2022. “Investigating Instability of Regression Parameters and Structural Breaks in Nigerian Economic Data from 1984 to 2019.” International Journal of Mathematics Trends and Technology 68 (12): 67–73. https://doi.org/10.14445/22315373/IJMTT-V68I12P509.
Rahayu, S., T. Sugiarto, L. Madu, Holiawati, and A. Subagyo. 2017. “Application of Principal Component Analysis (PCA) to Reduce Multicollinearity Exchange Rate Currency of Some Countries in Asia Period 2004-2014.” International Journal of Educational Methodology 3 (2): 75–83. https://doi.org/10.12973/ijem.3.2.75.
Tejada-Vera, Betzaida. 2013. “Mortality from Alzheimer’s Disease in the United States: Data for 2000 and 2010.” NCHS Data Brief, no. 116: 1–8.